Prepare Data

In this page, we will introduce the functions we provide to load datasets and split given data.

Load Data

In s3l.datasets.base, we provide some useful functions to load data. Here is the list:

'load_data',
'load_dataset',
'load_graph',
'load_boston',
'load_diabetes',
'load_digits',
'load_iris',
'load_breast_cancer',
'load_linnerud',
'load_wine',
'load_ionosphere',
'load_australian',
'load_bupa',
'load_haberman',
'load_vehicle',
'load_covtype',
'load_housing10',
'load_spambase',
'load_house',
'load_clean1'

Among them, load_data, load_dataset and load_graph functions can be used to load the data you prepare. Other functions load the built-in datasets which are commonly used by researchers. These functions return the data in the form which can be used by estimators directly. For example,

X, y = load_XXX(return_X_y=False)
# XXX is the name of dataset

We’ll show you how to use the two user-oriented functions load_data, load_dataset and load_graph. load_dataset is directly called in experiments classes, you can use them when you try algorithms outside experiment class or when you’re implementing you own experiment class.

load_data loads features and labels of a dataset given the file names.

X, y = load_data(feature_file, label_file)

load_dataset wraps load_data with another parameter name and loads built-in dataset if name matchs.

X, y = load_dataset(name, feature_file, label_file)

load_graph loads the graph in *.csv/npz/mat file and returns a matrix.

W = load_graph(graph_file)

Split Data

In s3l.datasets.data_manipulate, we provide some useful functions to split data. Here is the list:

'inductive_split',
'ratio_split',
'cv_split'

Among them, inductive_split can split the dataset into three parts: labeled set, unlabeled set and testing set, which is helpful for semi-supervised learning tasks.

from sklearn.datasets import make_classification
from s3l.datasets import data_manipulate

X, y = make_classification()
train_idx, test_idx, label_idx, unlabel_idx = \
            data_manipulate.inductive_split(X, y,test_ratio=0.3,
                    initial_label_rate=0.05, split_count=10)

ratio_split and cv_split help split the given data based on train/test ratio and k-Fold.

from sklearn.datasets import make_classification
from s3l.datasets import data_manipulate

X, y = make_classification()
# ratio_split
train_idx, test_idx = \
            data_manipulate.ratio_split(X, y, unlabel_ratio=0.3,
                split_count=10)

# cv_split
train_idx, test_idx = \
            data_manipulate.cv_split(X, y, k=3, split_count=10)

The returned XXX_indexes are lists of indexes which can be directly used by built-in estimators.